Simplifying the Design of Workflows for Large-Scale Data Exploration and Visualization

نویسندگان

  • Juliana Freire
  • Cláudio T. Silva
چکیده

Computing has been an enormous accelerator to science and has led to an information explosion in many different fields. To analyze and understand scientific data, complex computational processes must be assembled, often requiring the combination of loosely-coupled resources, specialized libraries, and grid and Web services. Workflow systems have therefore grown in popularity within the scientific community [2]. Not only do workflows support the automation of repetitive tasks, but they can also capture complex analysis processes at various levels of detail and systematically capture provenance information for the derived data products [1, 3]. The provenance (also referred to as the audit trail, lineage, and pedigree) of a data product contains information about the process and data used to derive the data product. It provides important documentation that is key to preserving the data, to determining the data’s quality and authorship, and to reproduce as well as validate the results. These are all important requirements of the scientific process. But applying traditional workflow technology to exploratory tasks brings new challenges. Whereas business workflows are primarily used to automate repetitive processes, scientific workflows are often used in exploratory tasks, where change is the norm. Furthermore, scientific workflows need to cater to a broader set of users, including many who do not have programming expertise. Even for systems that have sophisticated visual programming interfaces, the path from the raw data to insights is quite laborious and error-prone. Visual programming interfaces expose computational components (functions) as modules and allow the creation of complex pipelines1 which combine these modules in a workflow, where data flows along the connections between modules. These interfaces ease the creation of pipelines through the use of a simple programming model (dataflows) and by providing built-in constraint checking mechanisms (e.g., that disallow a connection between incompatible module ports). Notwithstanding, without detailed knowledge of the underlying computational components, it is difficult to understand what series of modules and connections ought to be added to obtain a desired result. In essence, there is no “roadmap”: systems provide very little feedback to help the user figure out which modules can or should be added to the pipeline. This problem is compounded by the fact that analysis pipelines contain many disparate components: data needs to be gathered, generated, integrated, transformed, and visualized. A novice user, or even an advanced user performing a new task, often resorts to manually searching for existing pipelines to use as an example. These examples are then adapted and iteratively refined until a solution is found. Unfortunately, this manual, time-consuming process is the current standard for creating pipelines rather than the exception. In this paper, we describe an infrastructure we developed to support the design of large-scale exploratory computational tasks, and in particular of data analysis through visualization. The infrastructure includes a set of scalable tools and intuitive interfaces that allow casual users to explore and re-use the knowledge embedded in pipeline specifications. By querying the task specifications, users can learn by example from the reasoning and/or analysis strategies of experts; expedite their training; and potentially reduce the time lag between data acquisition and insight. We also discuss how information in collections of pipeline specifications can be explored simplify the creation and refinement of pipelines. In particular, we show how this information can be used to build a recommendation system and a mechanism that for modifying pipelines by analogy. In our presentation, we will give a live demonstration of this infrastructure that has been implemented in the VisTrails system (http://www.vistrails.org). We will also discuss current efforts to use this infrastructure in a workflow collaboratory that supports social analysis of scientific data [4].

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Provenance to Streamline Data Exploration through Visualization

Scientists are faced with increasingly larger volumes of data to analyze. To analyze and validate various hypotheses, they need to create insightful visual representations of both observed data and simulated processes. Often, insight comes from comparing multiple visualizations. But data exploration through visualization requires scientists to assemble complex workflows—pipelines consisting of ...

متن کامل

Epiviz: Integrative Visual Analysis Software for Genomics

Title of Dissertation: Epiviz: INTEGRATIVE VISUAL ANALYSIS SOFTWARE FOR GENOMICS Florin Chelaru, Doctor of Philosophy, 2015 Directed By: Professor Héctor Corrada Bravo Department of Computer Science Computational and visual data analysis for genomics has traditionally involved a combination of tools and resources, of which the most ubiquitous consist of genome browsers, focused mainly on integr...

متن کامل

برآورد تابع بقای شرطی زمان شکست به‌شرط یک متغیر کمکی زمان‌متغیر با مشاهدات سانسورشده‌ی بازه‌ای

In this paper, we propose an approach for the nonparametric estimation of the conditional survival function of a time to failure‎ ‎given a time-varying covariate under interval-censoring for the failure time. Our strategy consists in‎ ‎modeling the covariate path with a random effects model, ‎as is done in the degradation and joint longitudinal and survival data modeling&lrm...

متن کامل

A Fuzzy Decision-Making Methodology for Risk Response Planning in Large-Scale Projects

Risk response planning is one of the main phases in the project risk management and has major impacts on the success of a large-scale project. Since projects are unique, and risks are dynamic through the life of the projects, it is necessary to formulate responses of the important risks. The conventional approaches tend to be less effective in dealing with the impreciseness of risk response p...

متن کامل

Design and Test of the Real-time Text mining dashboard for Twitter

One of today's major research trends in the field of information systems is the discovery of implicit knowledge hidden in dataset that is currently being produced at high speed, large volumes and with a wide variety of formats. Data with such features is called big data. Extracting, processing, and visualizing the huge amount of data, today has become one of the concerns of data science scholar...

متن کامل

Visualization: Insight on Your Work

Visualization is an important part of numerous simulation workflows. It helps users intuitively discover artifacts in their data, because it directly makes use of one of human’s fundamental sense: “vision”. Also, it does not require as much effort as raw data analysis. There are several usable workflows depending on current needs, each one using a different visualization approach. With the incr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008